In this exercise, we will be using functions from the
tidyverse package. Before you use an R package, you need to
load it into your session by calling the library function.
It’s a good idea to load all of the packages you need in a single code
chunk at the top of your Rmarkdown file, like this:
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0 ✔ purrr 0.3.5
✔ tibble 3.1.8 ✔ dplyr 1.0.10
✔ tidyr 1.2.1 ✔ stringr 1.5.0
✔ readr 2.1.3 ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
The file
Olympic_100m_results.csvcontains information on all Olympic 100m sprint medalists. Open it in Excel first to see what it looks like, then read it into R using theread_csvfunction. You’ll need to give the data frame a name; chooseolympic_100m_dataif you want to be consistent with the rest of the exercise and the solutions we provide.Hints:
If you’re using a Windows computer, you will need to close the file in Excel before you can open it in using other software, including R.
If you get error messages about the file not being found, double check you have typed the name correctly, and make sure your R working directory is set right. The RStudio menu option Session > Set Working Directory > To Source File Location is often useful.
If you get error messages about
read_csvnot being found, make sure you have loaded thetidyversepackage by running the code chunks near the top of this file.Make sure you are using the
read_csvfunction and not theread.csvfunction as these will not provide the same output.
olympic_100m_data <- read_csv("Olympic_100m_results.csv")
Rows: 138 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): Gender, Event, Location, Medal, Name, Nationality
dbl (3): Year, Result, Time
lgl (1): Wind
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
When you load the data file, you should see its name appear in the Environment tab in RStudio (normally in the top-right panel).
How many observations and how many variables does this data frame have?
What happens when you click the blue arrow to the left of the data frame? What do you think
chr,numandlogimean?What happens when you click on the name of the data frame (e.g.
olympic_100m_data)?What do you think the data in each column represents?
You can see from the list in the top right that there are 138 observations (rows) and 10 variables (columns).
The blue arrow reveals a list of columns in the data set, their data
type (chr for character, i.e. text; num for
numeric; logi for logical, i.e. dichotomous or binary
data).
Clicking on the data frame name shows a spreadsheet-style view of the data in the top-left panel.
Type the name of the data frame (e.g.
olympic_100m_data) into the R console on the bottom-left and press enter. What do you see?Type
glimpse(NAME_OF_DATA_FRAME)into the R console (changing that to the name of the data frame you chose earlier). What do you see?Now add these statements to an R code chunk. Try running them from the R Markdown document (Ctrl-Enter/Command-Enter or green ‘play’ button on the right.) Finally, knit the document into an html file and look at the result.
olympic_100m_data
# A tibble: 138 × 10
Gender Event Location Year Medal Name Natio…¹ Result Wind Time
<chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <lgl> <dbl>
1 M 100M Men Rio 2016 G Usain BOLT JAM 9.81 NA 9.81
2 M 100M Men Rio 2016 S Justin GATL… USA 9.89 NA 9.89
3 M 100M Men Rio 2016 B Andre DE GR… CAN 9.91 NA 9.91
4 M 100M Men Beijing 2008 G Usain BOLT JAM 9.69 NA 9.69
5 M 100M Men Beijing 2008 S Richard THO… TTO 9.89 NA 9.89
6 M 100M Men Beijing 2008 B Walter DIX USA 9.91 NA 9.91
7 M 100M Men Sydney 2000 G Maurice GRE… USA 9.87 NA 9.87
8 M 100M Men Sydney 2000 S Ato BOLDON TTO 9.99 NA 9.99
9 M 100M Men Sydney 2000 B Obadele THO… BAR 10.0 NA 10.0
10 M 100M Men Barcelona 1992 G Linford CHR… GBR 9.96 NA 9.96
# … with 128 more rows, and abbreviated variable name ¹Nationality
glimpse(olympic_100m_data)
Rows: 138
Columns: 10
$ Gender <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M"…
$ Event <chr> "100M Men", "100M Men", "100M Men", "100M Men", "100M Men"…
$ Location <chr> "Rio", "Rio", "Rio", "Beijing", "Beijing", "Beijing", "Syd…
$ Year <dbl> 2016, 2016, 2016, 2008, 2008, 2008, 2000, 2000, 2000, 1992…
$ Medal <chr> "G", "S", "B", "G", "S", "B", "G", "S", "B", "G", "S", "B"…
$ Name <chr> "Usain BOLT", "Justin GATLIN", "Andre DE GRASSE", "Usain B…
$ Nationality <chr> "JAM", "USA", "CAN", "JAM", "TTO", "USA", "USA", "TTO", "B…
$ Result <dbl> 9.81, 9.89, 9.91, 9.69, 9.89, 9.91, 9.87, 9.99, 10.04, 9.9…
$ Wind <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Time <dbl> 9.81, 9.89, 9.91, 9.69, 9.89, 9.91, 9.87, 9.99, 10.04, 9.9…
R refers to an individual column within a data frame using the notation
DATA_FRAME_NAME$COLUMN_NAME. For example, the columnTimein the data frameolympic_100m_datawould be writtenolympic_100m_data$Time.
What happens when you write a line of code with just the name of a column within a data frame, e.g.
olympic_100m_data$TimeWhat happens if you misspell the name of the column? What about using uppercase and lowercase differently from the name in the data frame?
There are a number of functions built into R which operate on numeric vectors. Some useful ones include
mean,sd,min,maxandmedian. Try calling these on theTimecolumn; e.g.mean(olympic_100m_data$Time)R can also do arithmetic calculations, like a pocket calculator or Excel formula. You can mix single numbers with vectors (columns of numbers) and R will usually do something sensible. For example,
2 * column_namewill return a vector where each element of the original column has been doubled, whilecolumn1 + column2will add corresponding elements of each column.
Calculate the speed of each runner in metres per second. (Hint: speed is
distance / time, and this data frame only contains information about a 100m race.)What is the fastest speed in metres per second? (Hint: use the
maxfunction.)
olympic_100m_data$Time
[1] 9.81 9.89 9.91 9.69 9.89 9.91 9.87 9.99 10.04 9.96 10.02 10.04
[13] 9.99 10.19 10.22 10.06 10.08 10.14 9.90 10.00 10.00 10.20 10.20 10.30
[25] 10.40 10.40 10.40 10.30 10.40 10.50 10.80 10.90 10.90 10.80 10.80 10.90
[37] 10.80 11.00 11.10 11.20 9.63 9.75 9.79 9.85 9.86 9.87 9.84 9.89
[49] 9.90 10.25 10.25 10.39 10.14 10.24 10.33 10.00 10.20 10.20 10.50 10.50
[61] 10.60 10.30 10.40 10.60 10.30 10.30 10.40 10.60 10.70 10.80 10.80 10.90
[73] 10.90 11.00 11.20 11.20 12.00 12.20 12.60 12.60 10.71 10.83 10.86 10.78
[85] 10.98 10.98 11.12 11.18 11.19 10.82 10.83 10.84 10.97 11.13 11.16 11.08
[97] 11.13 11.17 11.00 11.10 11.10 11.00 11.30 11.30 11.50 11.80 11.90 11.50
[109] 11.70 11.90 12.20 10.75 10.78 10.81 10.93 10.96 10.97 10.94 10.94 10.96
[121] 11.06 11.07 11.14 11.07 11.23 11.24 11.40 11.60 11.60 11.50 11.70 11.70
[133] 11.90 12.20 12.20 11.90 11.90 12.00
olympic_100m_data$Tiem
Warning: Unknown or uninitialised column: `Tiem`.
NULL
olympic_100m_data$time
Warning: Unknown or uninitialised column: `time`.
NULL
mean(olympic_100m_data$Time)
[1] 10.77674
sd(olympic_100m_data$Time)
[1] 0.6761183
min(olympic_100m_data$Time)
[1] 9.63
max(olympic_100m_data$Time)
[1] 12.6
median(olympic_100m_data$Time)
[1] 10.805
100 / olympic_100m_data$Time
[1] 10.193680 10.111223 10.090817 10.319917 10.111223 10.090817 10.131712
[8] 10.010010 9.960159 10.040161 9.980040 9.960159 10.010010 9.813543
[15] 9.784736 9.940358 9.920635 9.861933 10.101010 10.000000 10.000000
[22] 9.803922 9.803922 9.708738 9.615385 9.615385 9.615385 9.708738
[29] 9.615385 9.523810 9.259259 9.174312 9.174312 9.259259 9.259259
[36] 9.174312 9.259259 9.090909 9.009009 8.928571 10.384216 10.256410
[43] 10.214505 10.152284 10.141988 10.131712 10.162602 10.111223 10.101010
[50] 9.756098 9.756098 9.624639 9.861933 9.765625 9.680542 10.000000
[57] 9.803922 9.803922 9.523810 9.523810 9.433962 9.708738 9.615385
[64] 9.433962 9.708738 9.708738 9.615385 9.433962 9.345794 9.259259
[71] 9.259259 9.174312 9.174312 9.090909 8.928571 8.928571 8.333333
[78] 8.196721 7.936508 7.936508 9.337068 9.233610 9.208103 9.276438
[85] 9.107468 9.107468 8.992806 8.944544 8.936550 9.242144 9.233610
[92] 9.225092 9.115770 8.984726 8.960573 9.025271 8.984726 8.952551
[99] 9.090909 9.009009 9.009009 9.090909 8.849558 8.849558 8.695652
[106] 8.474576 8.403361 8.695652 8.547009 8.403361 8.196721 9.302326
[113] 9.276438 9.250694 9.149131 9.124088 9.115770 9.140768 9.140768
[120] 9.124088 9.041591 9.033424 8.976661 9.033424 8.904720 8.896797
[127] 8.771930 8.620690 8.620690 8.695652 8.547009 8.547009 8.403361
[134] 8.196721 8.196721 8.403361 8.403361 8.333333
max(100 / olympic_100m_data$Time)
[1] 10.38422
# or:
100 / min(olympic_100m_data$Time)
[1] 10.38422
The
EventandMedalcolumns in this data frame are categorical.
How does R display these columns if you type in their name, e.g.
olympic_100m_data$Event?You can get a simple frequency table using the
table()function. Try calling these on theEventandMedalcolumns, e.g.table(olympic_100m_data$Event)
olympic_100m_data$Event
[1] "100M Men" "100M Men" "100M Men" "100M Men" "100M Men"
[6] "100M Men" "100M Men" "100M Men" "100M Men" "100M Men"
[11] "100M Men" "100M Men" "100M Men" "100M Men" "100M Men"
[16] "100M Men" "100M Men" "100M Men" "100M Men" "100M Men"
[21] "100M Men" "100M Men" "100M Men" "100M Men" "100M Men"
[26] "100M Men" "100M Men" "100M Men" "100M Men" "100M Men"
[31] "100M Men" "100M Men" "100M Men" "100M Men" "100M Men"
[36] "100M Men" "100M Men" "100M Men" "100M Men" "100M Men"
[41] "100M Men" "100M Men" "100M Men" "100M Men" "100M Men"
[46] "100M Men" "100M Men" "100M Men" "100M Men" "100M Men"
[51] "100M Men" "100M Men" "100M Men" "100M Men" "100M Men"
[56] "100M Men" "100M Men" "100M Men" "100M Men" "100M Men"
[61] "100M Men" "100M Men" "100M Men" "100M Men" "100M Men"
[66] "100M Men" "100M Men" "100M Men" "100M Men" "100M Men"
[71] "100M Men" "100M Men" "100M Men" "100M Men" "100M Men"
[76] "100M Men" "100M Men" "100M Men" "100M Men" "100M Men"
[81] "100M Women" "100M Women" "100M Women" "100M Women" "100M Women"
[86] "100M Women" "100M Women" "100M Women" "100M Women" "100M Women"
[91] "100M Women" "100M Women" "100M Women" "100M Women" "100M Women"
[96] "100M Women" "100M Women" "100M Women" "100M Women" "100M Women"
[101] "100M Women" "100M Women" "100M Women" "100M Women" "100M Women"
[106] "100M Women" "100M Women" "100M Women" "100M Women" "100M Women"
[111] "100M Women" "100M Women" "100M Women" "100M Women" "100M Women"
[116] "100M Women" "100M Women" "100M Women" "100M Women" "100M Women"
[121] "100M Women" "100M Women" "100M Women" "100M Women" "100M Women"
[126] "100M Women" "100M Women" "100M Women" "100M Women" "100M Women"
[131] "100M Women" "100M Women" "100M Women" "100M Women" "100M Women"
[136] "100M Women" "100M Women" "100M Women"
table(olympic_100m_data$Event)
100M Men 100M Women
80 58
olympic_100m_data$Medal
[1] "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B"
[19] "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B"
[37] "G" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S"
[55] "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S"
[73] "B" "G" "S" "B" "G" "S" "B" "B" "G" "S" "B" "G" "S" "S" "S" "S" "B" "G"
[91] "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G"
[109] "S" "B" "G" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B"
[127] "G" "S" "B" "G" "S" "B" "G" "S" "B" "G" "S" "B"
table(olympic_100m_data$Medal)
B G S
45 46 47
There are a few ways to get help within R. Try these out:
- Click on the name of a function in a code chunk, and press F1.
- Type
?functionnamein the R console pane (bottom left - don’t add this to your R Markdown file). e.g.?table(don’t type theback tickseither!)- Type
help("functionname")in the R console. e.g.help("mean")- Type
help(package = "packagename")in the R console to get help on an R package, e.g.help(package = "ggplot2")The first three all do the same thing.
What do you notice about the documentation which comes with R?
The R documentation is terse and nerdy but comprehensive. It is okay for a reference but not very helpful when you’re first learning R.
R packages often contain a lot of different functions and looking at the list of functions in package is often unhelpful. There is a link “user guides, package vignettes and other documentation” in small font near the top which sometimes has more helpful help!
There are two common ways to restore your R session to a fresh start.
Clear all variables in your workspace: click the ‘broom’ icon in the Environment panel (top-right) or Session > Clear Workspace in the menu.
Restart your R session: Session > Restart R in the menu.
Try both of these. What do you think the difference is? (Hint: try running the code chunk with
read_csvafter each of them.)
Restarting the R session will also clear any loaded packages
(e.g. tidyverse) and reset your session’s working
directory.
© 2022 Statistical Consulting Centre, The University of Melbourne.